NUTCH-3162 Latency metrics to properly merge data from all threads and tasks#906
NUTCH-3162 Latency metrics to properly merge data from all threads and tasks#906lewismc wants to merge 6 commits intoapache:masterfrom
Conversation
|
Refactored tests and introduced |
|
…d tasks Wrap ParseImpl and BytesWritable holding the serialized T-Digest into a NutchWritable object.
There was a problem hiding this comment.
Hi @lewismc,
after fixing the ParseSegment job in f1d5dc3, testing various jobs was successful except for the IndexingJob which fails when merging the latency metrics, see inline comments.
I'll continue tomorrow:
- fix the unit tests for the parse job,
- verify the latency metrics job counters in the log files,
- have a closer look into the IndexingJob.
| if (fs.exists(latencyDir)) { | ||
| try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf, latencyDir)) { | ||
| FileOutputFormat.setOutputPath(mergeJob, new Path(tmp, "_latency_merge_out")); | ||
| boolean mergeSuccess = mergeJob.waitForCompletion(true); |
There was a problem hiding this comment.
2026-04-08 21:38:08,406 ERROR o.a.n.i.IndexingJob [main] Indexer: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/mnt/data/wastl/proj/crawler/nutch/test/tmp_1775677086987-2006747810/_latency
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:342)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:281)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:445)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:311)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:328)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:201)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1677)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1674)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1674)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1695)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:167)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:320)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:329)
when running
bin/nutch index -Dplugin.includes='indexer-dummy|index-(basic|more)' -nocrawldb /path/to/segments/20260408205641/
```
There was a problem hiding this comment.
Or when running on a single-node cluster (indexing to Solr):
2026-04-08 21:50:01,953 ERROR indexer.IndexingJob: Indexer: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/wastl/tmp_1775677717095--1619189953/_latency
This only affects the latency-merge job.
There was a problem hiding this comment.
Looked into SequenceFileInputFormat.listStatus: it either takes a single file or a directory tree with sequence files _latency/part-xxx/data.
There was a problem hiding this comment.
Yes, I wondered about this. I am not a huge fan of the intermediate output being written for IndexerJob either. I think we could even remove the changes for this job and address them separately. This will NOT have an impact on the Job execution... however the counters are not accurate.
…d tasks Fix unit test.
sebastian-nagel
left a comment
There was a problem hiding this comment.
Hi @lewismc,
TestParseSegment is fixed.
I've verified that the latency metrics are plausible for all jobs, except the indexer.
So my +1 for all parts except the indexer, so far.
For the indexer, I'm not 100% convinced regarding the solution with a temporary output folder.
We also had this issue with indexers writing files (indexer-csv, see NUTCH-2793 / PR #534): it might be an improvement to provide the possibility to the indexer plugins to write data to an index output directory. Because potentially multiple indexers can be used in the same job we would need subdirectories anyway. This would allow to have also one for the latency metrics. Would be more transparent.
| if (fs.exists(latencyDir)) { | ||
| try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf, latencyDir)) { | ||
| FileOutputFormat.setOutputPath(mergeJob, new Path(tmp, "_latency_merge_out")); | ||
| boolean mergeSuccess = mergeJob.waitForCompletion(true); |
There was a problem hiding this comment.
Looked into SequenceFileInputFormat.listStatus: it either takes a single file or a directory tree with sequence files _latency/part-xxx/data.
|
The temp Indexer output isn't ideal. Thank you for reviewing. This "Indexer Latency Merge" is one solution. I will follow up with alternative(s) soon. The best current documentation is our Javadoc I think we can do better than Javadoc and I am thinking about how we package better documentation. |


PR for NUTCH-3162 which addresses shortcomings in job-level latency percentiles (p50, p95, p99) for Fetcher, ParseSegment, and Indexer by merging TDigest data from all map tasks and threads and writing counters in a single reducer (or a dedicated merge job for Indexer). It should fix the cases where per-task counters were summed and percentiles were not merged.
This patch touches the following jobs
LATENCY_KEYLATENCY_KEYto partition 0 so one reducer merges all TDigestsLOG.errorand driver-levelErrorTrackercategorization is only run.I think this fixes the issues. Arguably it is more complex than logging to file and performing some ETL to extract metrics from logs however this solution does stick with convention by keeping metrics within the Hadoop ecosystem.
Finally, the PR is complemented with unit tests. This asllowed me to think more about how we can add metrics validation in Nutch but that will come in a separate issue/PR under NUTCH-3131.
Thanks for any review.